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Method for Producing and Identifying Soluble Pro-bein 

Domains 

Field of the Invention 

5 The present invention relates to methods for producing 
and identifying fragments of proteins, and more 
particularly fragments which are soluble domains of a 
protein. The present invention further provides 
libraries of expression vectors and host cells comprising 
10 nucleic acid encoding the protein fragments and libraries 
of the protein fragments. 

Background of the Invention 

There are many large soluble^ transmembrane and integral 
15 membrane multi-domain proteins of intense biomedical 

interest. These substances are by definition potential 
drug targets. Structural and functional analyses of 
these proteins will provide the basis for design of new 
strategies for therapeutic intervention in disease. High 
20 resolution structural study of proteins provides a basis 
for understanding biological and disease process'^-.s at 
molecular and atomic levels that is often necessary to 
support rational design or optimisation of new candidate 
drugs . 

25 

Biochemical and functional assays are used in drug 
discovery programs to identify compounds that interact 
with proteins in a manner that interferes with the 
biological function of the protein. These assays require 
30 large quantities of soluble protein to allow screening of 
thousands of compounds from chemical libraries. However, 
the production of sufficient quantities of these large 
proteins for detailed functional and structural studies 
is rarely feasible using existing methods. In the rare 
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cases where sufficient quantities of large multi-domain 
proteins can be produced, it is seldom possible to obtain 
the protein crystals that are prerequisite to structural 
study by X-ray crystallography or other techniques used 
5 in the art such as NMR. However, production of soluble 
fragments of these proteins may allow identification of 
regions of a protein that are responsible for the 
biological functions (or malfunctions), and facilitate 
detailed structural and functional analysis. Production 

10 of soluble protein fragments is therefore necessary to 
allow in-vitro biochemical and structural analyses of 
multi-domain proteins that cannot be obtained in 
sufficient quantities in intact form. However, little is 
known about the domain structure and organisation of many 

15 of these large proteins and bio-informatics approaches 
often do not provide a sufficient basis for rational 
identification of candidate domains. As a result, 
identification and expression of domains from many of 
these large proteins have proved refractory to the 

20 established, rational, recombinant protein 
engineering/expression strategies . 

There are currently three main empirical approaches to 
identification of soluble protein domains: 1) 

25 bioinf ormatics and sequence analysis to estimate the 
location of domain boundaries of proteins based on 
sequence similarities with known proteins, 2) proteolytic 
fragmentation of the intact protein and identification of 
soluble fragments (REF) , and 3) generation of ^^random'" 

30 gene fragments, cloning to produce a gene fragment 
library and expression screening of the library to 
identify clones expressing soluble, folded protein 
fragments. Holistically these methods suffer from a 
number of weaknesses such as: a requirement for 
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quantities of the intact multi-domain protein for 
fragmentation that often cannot be obtained; failure to 
isolate gene fragments capable of producing soluble 
protein domains . 

5 

The most commonly used method for identification of 
minimal protein domains (domain-mapping) involves limited 
proteolysis of a target protein and identification of 
proteolytically resistant fragments by mass spectroscopy 

10 (e.g. Cohen, S. L. (1996)). This approach is based on 

the assumption that stable, folded domains are likely to 
be more resistant to proteolysis than unstructured 
regions of peptide sequence that are often found between 
domains. As this approach usually requires a reasonable 

15 quantity of highly purified, intact, soluble target 
protein derived from the native biological source, a 
large portion of human proteins of biomedical interest 
cannot be obtained in sufficient quantities. Protein 
samples are then enzymatically fragmented using various 

20 proteases. The molecular masses of the protein fragments 
generated are then measured by mass spectroscopy and the 
identity of the fragments may then be confirmed by 
further fragmentation (i.e. protein sequencing by MS) . 
It is then assumed that protein fragments of around sixty 

25 or more amino acids residues in length represent stably 

folded domains since these portions of the protein appear 
to have greater resistance to degradation by proteases. 
This information is then used to design expression 
vectors for recombinant expression of the soluble domain 

30 candidates identified above . 

In practice, there are several caveats with this approach 
that may result in failure to detect individual protein 
domains. The cleavage specificity of proteases is 
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limited to the peptide bond between certain amino-acid 
residue types (e.g. trypsin cleaves the peptide bond to 
the C-terminal side of basic residues) . The position of 
protease cleavage sites is therefore not a function 
5 solely of structural context, but also of amino acid 

sequence context. Thus, if in practice the appropriate 
amino acid types are not found in a particular inter- 
domain peptide sequence, then the adjacent domains may 
not be separated and therefore the individual domains 

10 would not identified- In addition, steric hindrance may 
prevent protease-mediated cleavage of inter-domain 
peptide sequences that are short in length. Another 
major caveat of these approaches is that many domains 
comprise flexible loop regions that may be 

15 proteolytically sensitive resulting in cleavage within a 
domain (i.e. fail to detect the correct boundaries of a 
domain) . Finally, a peptide sequence that corresponds to 
a soluble, folded proteolytic fragment may not 
necessarily be capable of autonomous folding and 

20 therefore recombinant over-expression of this particular 
peptide sequence may fail to produce soluble protein of 
tertiary structural integrity. 

A DNA fragmentation based domain-mapping/identification 
25 method requires a protocol for generation of DNA 

fragments from an intact coding sequence in a manner that 
allows essentially random sampling of all possible 
fragments of appropriate size range (i.e. of a size 
capable of coding for a domain -200-1500 nucleotides) . 
30 In addition, the fragmentation protocol should ideally be 
generically reproducible, and must therefore be 
independent of differences in the properties of 
particular DNA targets, and produce fragments that are 
compatible with conventional methods for cloning of DNA 



4 



wo 03/040391 



PCT/GB02/05075 



into vectors for protein expression. However^ none of 
the existing DNA fragmentation methods fully meet 
requirements of random sampling, generic reproducibility, 
often displaying biased sampling and/or requiring 
5 optimisation of the method for particular target DNA 
properties such as DNA chain-length, and/or producing 
fragments that are incompatible with subsequent cloning 
applications. This is not surprising as many methods for 
fragmenting large DNA molecules have been developed for a 
10 wide variety of purposes other than protein domain 
identification . 

A DNA fragmentation based domain-mapping/identification 
method requires a method for cloning of the DNA fragment 

15 mixture to produce a library of the gene fragments. A 

screening assay must then be used to identify clones that 
produce soluble folded protein fragments. A number of 
approaches have been developed for generation of 
libraries of different clones for a range of purposes 

20 including: large-scale DNA sequencing projects (e.g. 
shotgun cloning) ; selection of mutant proteins with 
particular enhanced functional properties (e.g. using 
gene-shuffling or random mutagenesis) ; and identification 
of epitopes for monoclonal antibodies by selection from a 

25 phage-display peptide library. Established library-based 
approaches to selection of protein variants or mutants 
have been recently adapted to identification of domains 
in large proteins including for example: a) cloning of 
DNA fragments into a bacteriophage surface-expression 

30 vector for expression as fusions with bacteriophage 
structural proteins (phage-display) using affinity 
selection as readout; b) cloning of DNA fragments into 
expression vectors to produce fusions with a reporter 
gene such as GFP or an antibiotic resistance gene, using 
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fluorescence and antibiotic resistance respectively as 
readout of recombinant protein solubility in vivo. 

Phage display approaches involve enzymatic fragmentation 
5 of coding DNA and cloning of these fragments into a 
bacteriophage surface-expression vector to produce a 
phage display library of clones expressing different gene 
fragments on their surface. A method has been described 
involving shotgun cloning coupled with phage display 

10 mapping of functional domains of two streptococcal cell- 
surface proteins (Jacobson^ et al . ^ 1997).- A phage- 
display library may be screened using a number of 
different approaches such as: target protein specific 
affinity selection and DNA sequencing of clones to 

15 identify the minimal fragment that retains binding 
affinity (e.g. Moriki et al., 1999); surface 
immobilisation of phage clones followed by limited 
proteolysis and washing to identify recombinant 
bacteriophage clones that are most resistant to 

20 proteolysis and are likely to display a fragment that has 
tertiary structure (Finucane et al., 1999). A limitation 
of affinity selection methods for screening of fragment 
libraries is a requirement for knowledge of the binding 
affinity (s) of the target protein, since this excludes 

25 the large number of proteins for which no specific 

binding or enzymatic activity has yet been established. 
Screening by limited proteolysis of phage particles 
adhered to a surface also suffers from the same caveats 
as other limited proteolysis approaches described above. 

30 

'^Random PGR" has been used to generate fragments of 
target coding sequence for screening for soluble domains 
as fusions with green fluorescent protein (Kawasaki and 
Inagaki 2001) , Caveats with this approach include: 
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^^random PGR" is not truly random and will therefore not 
produce a complete library of all possible gene fragments 
of the appropriate size range; attachment of GFP to the 
expressed gene fragment may affect the folding and 
5 solubility of particular candidate domains resulting in 
both false negative and false positive results. An in 
vivo method for improvement of the solubility of proteins 
and protein domain constructs has been described 
involving mutagenesis of target proteins and production 

10 of fusions of target proteins with the antibiotic 

resistance gene chloramphenicol acetyl transferase and 
selection of clones with enhanced resistance to 
chloramphenicol (Maxwell et al . , 1999). This method has 
not been used for domain identification. A caveat with 

15 this method is that there is only limited discrimination 
between soluble and insoluble proteins and the method 
does not select between folded and misfolded soluble 
fusions- An in vivo structural complementation based 
assay has been described involving fusions of the alpha 

20 fragment of beta-galactosidase with the C-terminus of 

target proteins so that if the fusion protein proves to 
be insoluble then interaction with the omega subunit will 
be prevented resulting in loss of beta-galactosidase 
activity (Wigley et al.;. 2001). 

25 

In summary^ phage-display and fusion protein based 
methods have the common caveat that attachment of a 
reporter protein to a test protein is likely to influence 
the folding and solubility of the test protein in an 
30 unpredictable and target protein specific manner. In 

practice, existing DNA fragmentation approaches are not 
ideal for protein domain identification methods as none 
of these fully meet the requirements of random sampling, 
generic reproducibility and compatibility with subsequent 
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cloning applications. In addition, all existing methods 
for domain identification including limited proteolysis, 
gene fragmentation based methods such as phage display 
and fusion protein based screening methods all have 
5 serious limitations. These undoubtedly lead to failure 
to detect some protein domains and failure to identify 
the domains or regions of protein that are responsible 
for biological activities that could become the new 
targets for therapeutic intervention and drug 
1 0 development . 

Summary of the Invention 

Broadly, the present invention relates to methods for 
producing and identifying fragments of proteins, and more 

15 particularly to methods for generating and identifying 
soluble protein domains. In preferred aspects, the 
present invention is based on two innovative methods: 1) 
one relates to a method for generating a library of 
nucleic acid fragments from nucleic acid encoding a 

20 desired polypeptide, and more especially a library of 
essentially, randomly sampled fragments of coding DNA 
sequence predominantly of defined size range; and 2) a 
second relates to a method for selecting cloned gene 
fragments from the library that encode soluble protein 

25 domains. 

In preferred embodiments, the present invention provides 
a holistic empirical method for the preparation and 
identification of regions of protein sequence that 
30 correspond to minimal domains or larger soluble fragments 
(e.g. several domains) and also permits production of 
these fragments in a form that is compatible with the 
structural and functional analyses identified above. 
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Accordingly^ in a first aspect;, the present invention 
provides a method for producing a library of nucleic acid 
fragments;, the nucleic acid fragments encoding one or 
more portions of a polypeptide the method comprising: 
5 amplifying a nucleic acid sequence encoding the 

polypeptide in the presence of a non-native nucleotide so 
that the non-native nucleotide is incorporated into an 
amplified product nucleic acid sequence at a frequency 
related to the relative amounts of the non-native 

10 nucleotide and its corresponding native nucleotide if 
presents- 
contacting the product nucleic acid sequence with 
one or more reagents capable of recognising the presence 
of the non-native nucleotide and cleaving the product 

15 nucleic acid sequence or excising the non-native 

nucleotide, thereby producing nucleic acid sequences 
encoding fragments of the polypeptide. 

In a further aspect, the present invention provides a 
20 library of nucleic acid sequences encoding fragments of 

the polypeptide produced by the methods described herein. 

In the present invention, "a non-native nucleotide" is a 
deoxynucleotide other than deoxyadenine (dA) , 

25 deoxythymidine (dT) , deoxycytosine (dC) or deoxyguanine 

(dG) that can replace the corresponding native nucleotide 
and is recognisable by the reagent used to cleave the 
product nucleic acid sequence or excise the non-native 
nucleotide from the product nucleic acid sequence. 

30 Preferably, the non-native nucleotides are neutral in 

terms of coding and are non-mutagenic . Examples of non- 
native nucleotides include uracil which can be used to 
replace thymidine and 3-methyl adenine which can be used 
to replace adenine. 
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Preferably, the amplification of the nucleic acid 
sequence is carried out using PGR using a non-native 
deoxynucleotide, either alone or in a mixture of the non- 
native and native nucleotide. 

5 

The starting nucleic acid sequence employed in the method 
may be a nucleic acid sequence encoding one or more 
polypeptide (s) . In other embodiments, the starting 
nucleic acid comprises a cDNA or RNA library, or genomic 
10 DNA- 

Preferably, the method comprises the further step of 
ligating the nucleic acid sequences encoding fragments of 
the desired polypeptide sequence into expression 
15 vector (s) to provide a library of expression vectors, and 
the optional further step of transforming host cells with 
the expression vectors to produce a library of host cells 
capable of expressing fragments (domains) of the 
polypeptide . 

20 

The method for generating random gene fragments involves 
random incorporation of a non-native nucleotide into the 
product nucleic acid sequence, in place of a native 
nucleotide, at a frequency that is preferably determined 

25 by the molar ratio of non-native to native nucleotide 
used in preparation of the coding sequence. The 
amplified nucleic acid product is then preferably 
contacted with a reagent capable of recognising and 
cleaving the sequence at the non-native nucleotide, for 

30 example by using an enzyme such as a DNA glycosylase or 
endonuclease, which can recognise the presence of the 
non-native nucleotide and cleave the nucleic acid 
sequence at or around the non-native nucleic acid 
sequence. A preferred protocol employs enzyme (s), (3- 
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elimination and temperature changes in order to generate 
DNA fragments derived by essentially unbiased sampling 
and predominantly of defined size range. The method of 
the present invention is particularly advantageous as it 
5 allows the production of nucleic acid fragments of a size 
which encode protein domains of the polypeptide, e.g. 
preferably between 100 and 1500 nucleotides, more 
preferably between 200 and 1200 nucleotides, and most 
preferably between 300 and 1000 nucleotides in length, 

10 and is capable of fine sampling of the nucleic acid 

encoding the polypeptide, producing fragments on average 
every second nucleotide. In order to allow generic 
application for library sampling of any polypeptide it 
may be advantageous to re-code certain nucleotide 

15 sequences to contain more incorporation sites for the 
non-native nucleotide, up to the limits imposed by the 
constraints of the genetic code. 

Optionally, the nucleic acid fragments may then be 
20 further amplified to produce nucleic acid fragments for 

further uses. Additionally or alternatively, the nucleic 
acid fragments may be exposed to enzymes that mediate 
attachment of the fragments to other DNA molecules, such 
as an expression vector, comprising sequences responsible 
25 for control of transcription and translation of the gene 
fragments and optionally sequence encoding affinity tag 
peptide sequences and optionally sequence for replication 
of the derived DNA constructs in host cells to produce 
gene fragment expression constructs. 

30 

Thus, in a preferred embodiment, the present invention 
provides a method of producing fragments of a desired 
polypeptide, the method comprising expressing the nucleic 
acid sequences encoding fragments of the desired 



11 



wo 03/040391 



PCT/GB02/05075 



polypeptide and optionally isolating the polypeptide 
fragments thus produced. Preferably, the polypeptide 
fragments are expressed as fusions with an affinity tag, 
so that they can be purified by affinity chromatography . 
5 Preferably, peptide based affinity tags will be less than 
25 amino acid residues long, and more preferably less 
than 15 residues long. Preferred affinity tags have 
minimal effect on the solubility, stability and/or 
aggregation state of the attached protein fragment. The 
10 use of C-terminal affinity tags is preferred as this 
permits the selection of clones that express in-frame 
fragments of DNA, while DNA fragments which are out~of- 
frame would tend to terminate prior to the translation of 
the tag. 

15 

Examples of suitable affinity tags include polyhistidine 
(e.g. the hexa-His tags exemplified herein) which bind to 
metal ions such as Ni^"*" or Co^"^, Flag or Glu epitopes which 
bind to anti-Flag antibodies, S-tags which bind to 

20 streptavidin, calmodulin binding peptide which binds to 
calmodulin in the presence of Ca^"^, and ribonuclease S 
which binds to aporibonuclease S. Examples of other 
affinity tags that can be used in accordance with the 
present invention will be apparent to those skilled in 

25 the art. 

In a further aspect, the present invention provides a 
library, e.g. as produced by a method of described 
herein, which is: 
30 (a) a library of nucleic acid fragments of a parent 

nucleic acid sequence, wherein the nucleic acid fragments 
have a size range as disclosed herein and are preferably 
sampled from the parent nucleic acid sequence on average 
about every second nucleotide; or 
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(b) a library of expression vectors which comprise- 
a plurality of the nucleic acid fragments as set out in 
(a) f wherein each fragment is ligated to a nucleic acid 
sequence encoding an affinity tag and optionally one or 

5 more further sequences to direct the expression of the 
nucleic acid fragment and the affinity tag; or;. 

(c) a library of host cells transformed with the 
expression vectors as defined in (b) ; or 

(d) a library of polypeptide fragments produced by 
10 expressing the nucleic acid sequences^ wherein each 

polypeptide is coupled to an affinity tag. 

Preferably, this method makes use of non-native 
nucleotides, and in particular non-native nucleotide 

15 bases that can be randomly incorporated into the DNA 
duplex and then selectively excised to produce the 
nucleic acid fragments of the polypeptide- None of the 
current enzymatic methods reviewed above, that aim to 
produce DNA fragments of essentially random distribution 

20 with respect to the source DNA (e.g. DNAase 1 digestion), 
provide robust control of fragment size range or sampling 
of DNA in a manner fully independent of DNA secondary 
structure, or robust reproducibility. In contrast, the 
present method preferably provides fine sampling with 

25 cleavage every second nucleotide on average, robust 
control of fragment size range, rapid and facile 
execution, and robust reproducibility. The DNA produced 
by the method is also compatible with blunt ended and TA 
cloning methods for construction of expression vectors. 

30 

In a preferred embodiment, the present invention employs 
a DNA fragmentation method based upon an enzymatic 
fragmentation DNA base-excision repair pathway, (Savva, 
et al., 1995; Savva and Pearl, 1995; Panayotou, et al.. 
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1998; Barrett^ et al., 1998; Barrett et al.,1999). This 
system initiates the removal of uracil the pro-mutagenic 
deamination product of cytosine from DNA by the 
sequential hydrolysis of the bond linking the base to the 
5 sugar, followed by cleavage of the sugar phosphate 

backbone at the abasic site by an apurinic/apyrimidinic 
endonuclease (APE) . The initial reaction, catalysed by 
uracil-DNA glycosylase (UDG) is exquisitely specific for 
uracil, and proceeds with very high efficiency. Thus, 

10 exposure to UDG and APE enzymes produces a single-strand 
nick in a dsDNA molecule wherever a uracil occurs. Like 
the normal DNA component thymine (identical to 5-methyl- 
uracil) , uracil forms stable Watson-Crick base pairs with 
adenine, and can be efficiently introduced into dsDNA by 

15 template-dependent DNA polymerase reactions, using Pol 1 
family enzymes such as Taq in PGR reactions- The widely 
used archaeal DNA polymerases such as Pfu or Vent are 
inhibited by template strand uracil (Greagg et al., 1999) 
and are not suitable for this purpose. Incorporation of 

20 uracil opposite a template-strand adenine occurs with 

comparable efficiency to incorporation of thymine, and is 
unbiased by sequence context. Thus, the probability of 
uracil incorporation in the daughter strand opposite a 
template-strand adenine is purely a function of the ratio 

25 of TTP/dUTP present in the PGR reaction mix and 

independent of uracil incorporation in previous cycles. 
The product of an ''ideal' TTP/dUTP PGR reaction is a 
mixture of otherwise identical double-stranded DNA 
molecules in which each possible thymine in either strand 

30 has been replaced by uracil. PGR under these conditions 
is robust even for relatively large PGR products. When 
this reaction mixture is exposed to UDG and APE to 
completion, single-strand breaks are introduced at each 
position at which a uracil was incorporated. A typical 
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mammalian genome has a thymine content « 25% ^ therefore 
double stranded DNA fragments are generated beginning and 
ending « every 2^*^ base since cleavage may occur at uracil 
sites on both coding and non-coding strands. 

5 

Cleavage by APE leaves a deoxyribose phosphate moiety at 
the 3'' or 5' side of the nick, depending on the 
specificity of the APE used. The deoxyribose phosphate 
moiety may then be removed by p~elimination, which is 

10 accelerated by mild bases such as spermine and elevated 
temperature (Bailly and Verly^ 1989) to produce single 
nucleotide gaps in one strand of the duplex. In order to 
produce blunt-ended DNA fragments for cloning two 
alternative approaches may be used: 1) cleavage of the 

15 single-stranded DNA opposite the single-nucleotide gaps 
in the duplex DNA using Sl-nuclease (Vogt, 1973) (Figure 
1) ; 2) thermal denaturation of the duplex DNA and re- 
annealing of the DNA at reduced temperature and filling 
of 3' recesses using a template dependent DNA polymerase, 

20 followed by removal of 3' extensions using a single- 
strand specific exonuclease with 3'' -5' exonuclease 
activity such as Mung bean nuclease or a single-strand 
specific endonuclease such as Sl-nuclease (Figure 2) . 

25 This DNA fragmentation method has several advantages over 
other possible methods. Firstly, given pure reagent 
enzymes, every enzymatic step can be allowed to go to 
completion, so that the size distribution of the 
fragments generated, is dictated solely by the TTP/dUTP 

30 ratio used in the original PGR reaction. This is in 

contrast to other enzymatic digestion approaches such as: 
cleavage by endonucleases (eg. DNAase I) that cleave both 
strands of duplex DNA, which fully degrade DNA to free 
nucleotides if the digestion is allowed to go to 
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completion. Computer simulations of the present method 
using a 5120 base pair gene suggest that a TTP/dUTP ratio 
of 100:1 will give even cover of the coding sequence, and 
good representation of fragments in the desired ^domain' 
5 size range (~300--1000 nucleotides) , Secondly, all the 
procedures involved are enzymatic and therefore carried 
out under ^mild' conditions that will cause no other DNA 
damage, and are completely compatible with rapid 
efficient DNA purification methods such as ion-exchange 
10 and silica-based adsorption methods that may be used 

between subsequent steps. Thirdly^ the products of these 
reactions are fully ^biological' and suitable for cloning 
into expression vectors by blunt-end ligation or TOPO- 
isomerase I-mediated ligation. 

15 

It would also be possible to employ a different non- 
native nucleotide and use a corresponding enzyme which is 
capable of recognising the non-native nucleotide in the 
amplified nucleic acid sequence and removing it from the 

20 amplified nucleic acid sequence or cleaving the sequence, 
thereby generating the fragments. One example is 3- 
methyladenine-DNA glycosylase from E.coli which is 
another monospecific DNA glycosylase that could also be 
used if deoxy-3-methyladenine (3-meA) mononucleotides are 

25 incorporated instead of deoxyadenine (both form base 
pairs with thymidine) . This nucleotide could be 
generated by exposing deoxyadenine mononucleotides to the 
methylating agent methyl methanesulphonate (MMS) and re- 
purifying them. 

30 

In many circumstances, it will be desirable to generate 
^ragged- terminus' libraries in which, for example, a 
domain such as an N-terminal domain is always present, 
but a wide range of C-termini are to be sampled. This 
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can be readily achieved using the method by performing 
two PGR steps and a thermal denaturation and annealing 
step: 1) amplification of the constant 5'' -segment 
encoding the N-terminus in a TTP PGR reaction; 2) 
5 amplification of a 3' segment that partially overlaps 
with the 5' segment in a TTP/dUTP PGR reaction; 3) and 
then mixing the products of these two PGR reactions 
before thermal melting and re-annealing. A restriction 
endonuclease (RE) site, that generates a ^'sticky-ended'^ 

10 on cleavage, may be introduced into the 5'' extremity of 
the 5' -segment, so that the library of N-terminally 
constant but C-terminally ragged coding sequences can 
then be efficiently cloned into a vector cleaved the 
above RE and another with a second RE that generates a 

15 blunt end. 

In a further aspect, the present invention provides a 
method of identifying soluble protein domains, the method 
comprising : 

20 expressing a library of nucleic acid fragments to 

produce the protein domains encoded by the fragments, 
wherein the protein domains are expressed as fusions with 

an affinity tag; and 

separating soluble proteins using the affinity tag. 

25 

Examples of affinity tags that can be employed in the 
present invention are provided above and many others will 
be apparent to the skilled person. The use of G-terminal 
affinity tags is preferred as this permits the selection 
30 of clones that express in-frame fragments of DNA, while 
DNA fragments which are out-of-frame would tend to 
terminate prior to the translation of the tag. 

The method may comprise the additional step of 
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identifying soluble proteins which are domains of the 
polypeptide, e.g. share a binding or biological activity 
with the full length parent polypeptide - 

5 Optionally, the method comprises making a library of 
soluble protein fragments or domain and contacting the 
fragments or domains with one or more candidate compounds 
to determine whether one or more of the candidate 
compounds binds to and/or modulates an activity of a 

10 protein fragment or domain present in the library. The 
candidate compounds may be small molecules or 
alternatively candidate polypeptide binding partners, 
e.g. the method can be used to investigate ligand- 
receptor binding, enzyme-substrate binding, antibody- 

15 antigen binding, protein-ligand binding or protein- 
nucleic acid binding- In still further embodiments, two 
or more libraries of soluble protein fragments or domains 
can be crossed to determine whether binding or modulation 
of activity occurs between members of the libraries. By 

20 way of example, in this embodiment of the invention, 
libraries of domains of two proteins can be made to 
determine which portions of those proteins are involved 
in binding and biological activity. 

25 In this aspect of the present invention^ the nucleic acid 
fragments is introduced into an expression vector (s) to 
produce a library of different DNA fragment expression 
constructs and protein expression is induced and the 
derived protein then treated in a novel approach that 

30 selectively removes insoluble and/or soluble misfolded 
and/or non-specif ically aggregated protein fragments 
allowing selective detection and purification of the 
soluble folded unaggregated or specifically aggregated 
protein fragments . 
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The approach makes use of the observation that 
empirically the process of purification of affinity 
tagged (such as hexahistidine tagged) proteins by 
5 affinity chromatography (such as metal affinity 

chromatography) is strongly selective for soluble^, folded 
proteins. Selection occurs in several stages in the 
purification method including: loss of insoluble protein 
at filtration or centrif ugation steps; loss of weakly 

10 soluble, misfolded or non-specif ically aggregated protein 
by precipitation or non-specific binding to various 
surfaces such as plastic and glass surfaces at all stages 
of purification; loss of misfolded or non-specif ically 
aggregated protein by failure to adsorb to affinity 

15 media^ and/or loss at washing steps. In our studies, 

affinity tags, such as the hexa-histidine tag, appear to 
display considerably lower accessibility to affinity 
chromatographic media when attached to misfolded, 
aggregated and/or insoluble target proteins, rather than 

20 to stably-folded, un-aggregated, soluble target proteins. 
This selectivity is likely to result in part from 
differences in the degree of steric hindrance of binding 
to affinity media, resulting from the properties of the 
target protein (e.g. soluble vs. insoluble, folded vs. 

25 misfolded, non-specif ically aggregated vs. un-aggregated 
or specifically aggregated) . In this novel method, the 
DNA fragment expression library is induced and screened 
for soluble protein expression on the basis of the 
selectivity of affinity purification media for binding of 

30 folded, soluble tagged proteins over misfolded, insoluble 
or aggregated tagged proteins. 

In some embodiments, the blunt-ended DNA fragments may be 
operationally linked to DNA sequences such as an 
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expression vector,, comprising sequences responsible for 
control of transcription and translation of the gene 
fragments and optionally sequence encoding affinity tag 
peptide sequences and optionally sequences for 
5 replication of the derived DNA constructs in host cells. 

In some embodiments the library of blunt ended gene 
fragments are ligated into a suitable expression vector 
using conventional blunt-ended ligation methods. 

10 Alternatively;, the blunt-ended gene fragments are cloned 
into a suitable expression vector. An inducible 
expression vector may be used such as those based on the 
pET series in which the restriction fragments can be 
inserted between the T7 promoter and start codon at the 

15 5' end, and stop-codons and transcription terminator at 
the 3'' end. Different versions of the vector may be 
constructed, to include an affinity tag (e.g. a Hise-tag) 
and an optional protease cleavage site at the N-terminus 
or C-terminus of the expressed fragment- A number of 

20 different vectors may be employed to provide start and 

stop codons in all three reading frames. The procedures 
described here are not limited to the use of the Hise^tag, 
and allow for the use of alternative tags and/or 
development of alternative short tags compatible with 

25 fluorescence or FRET-based protein detection strategies 
for example. The expression vectors constructed above 
constitute a gene fragment expression library. This 
library is then transfected into host cells and the 
transformed cells then spread on to selection media 

30 plates. 

Several hundreds or thousands of individual colonies may 
then be picked from the selection media plates and 
transferred to multi-well growth plates containing 
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suitable growth medium. Several hundreds or thousands of 
clones may be analysed, so that all subsequent stages may 
be processed in parallel utilising multi-well formats 
implemented on a multi-well plate format liquid-handling 
5 robot. Plates are incubated at 15-37°C overnight, and 
aliquots transferred into a second plate for growth for 
2-3 hours. Optionally, expression may be induced by 
addition of inducer molecules or temperature change, and 
cultures grown for a further period post-induction. 

10 Alternatively, a constitutive promoter system may be 

utilised. Cell-growth is monitored by optical density 
measurement. The cells are then lysed and then contacted 
with appropriate affinity chromatography media such as 
metal chelate media in conditions under which insoluble 

15 or soluble mis-folded protein molecules are removed by 

precipitation or adsorption onto surfaces, such that only 
soluble folded protein fragments are efficiently 
purified- The purified soluble protein fragments are 
analysed with respect to concentration and covalent 

20 structural integrity. 

Preferably, the expressed proteins are released for 
separation under non-denaturing conditions, e.g. by 
enzymes, or non-denaturing detergents. Thus, host cells 

25 such as induced bacterial cells are lysed using lysozyme 
and non-denaturing detergents, and the lysates applied to 
a multi-channel filter system (e.g. Qiagen TurboFilter) 
that removes unbroken cells, cell debris and insoluble 
material- Alternatively, the lysates may be clarified by 

30 centrif ugation. The clarified lysates containing the 

soluble contents of the induced cells are then purified 
in parallel in multiwell format by affinity 
chromatography (e.g. metal affinity chromatography) and 
assayed by anti-tag immunoblot or ELISA, SDS-PAGE and 



21 



wo 03/040391 



PCT/GB02/05075 



mass spectrometry and other methods known to those 
skilled in the art. This combination of readouts 
guarantees high sensitivity (blot or ELISA) , assessment 
of purity (SDS-PAGE) and validation of the molecular 
5 composition, in addition to quantifying the protein 
expression level- In an alternative configuration of 
this embodiment ir multiple clones are individually picked 
from the selective media plate and then cultured together 
in selective liquid media and processed together at all 

10 subsequent steps in order to reduce the total number of 
parallel operations to be performed. The chances of any 
one fragment of the appropriate size range corresponding 
to a folded domain and therefore giving a positive 
readout is likely to be 0,01-1%. In this context, when a 

15 pool of clones gives a positive readout then each 
original clone present in the pool or subpool is 
reprocessed to identify which clone (s) produced the 
positive readout - 

20 In a further alternative embodiment, all colonies from 

the selection media plates may be pooled and cultured in 
single vessel containing selective liquid media as 
desc'ribed above with respect to temperature and induction 
of expression, before cell lysis and purification by 

25 affinity chromatography. In this embodiment, the 

purified protein mixture is then analysed as described 
above and is likely to be found to contain multiple 
soluble protein fragments, which can be identified by 
protein sequencing and/or by fragmentation mass 

30 spectroscopy. The coding DNA sequences corresponding to 
the protein fragments identified are then amplified by 
PGR and cloned into expression vectors using established 
methods known to those skilled in the art and used for 
large-scale preparation of the protein fragment. In this 
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context different versions of expression vectors may be 
constructed, to include an affinity tag (e.g. Hisg-tag) 
and an optional protease cleavage site at the N-terrtiinus 
or C-terminus of the expressed fragment. 

5 

Once clones that express soluble protein fragments have 
been identified these clones are then cultured on a 
larger scale with optional optimisation of expression, 

and processed as described above, before purification 
10 employing the affinity tag;, e.g. employing chromatography 
media and methods well known to those skilled in the art. 
The purified soluble protein fragments are analysed with 
respect to concentration, covalent structural integrity, 
tertiary structural integrity and biological and/or 
15 enzymatic activity using methods well known to those in 
the art- 

One embodiment of this method seeks to identify soluble 
fragments of an extracellular protein or extracellular 
domains of a transmembrane or integral membrane protein 

20 that are suitable for high-level expression and secretion 
in bacterial systems. In this embodiment, the library of 
nucleic acid fragments is cloned into an expression 
vector that fuses a bacterial periplasmic export signal 
(such as OmpA) and signal peptidase cleavage site to the 

25 N-terminus of the expressed protein fragment. An 

affinity tag can optionally be included following the 
signal peptidase site or at the C~terminus of the 
expressed protein fragment. Bacterial colonies 
expressing these protein fragments are treated with 

30 gentle osmotic shock to release proteins from the 

bacterial periplasmic space, with minimal release of 
proteins from the cytoplasm. The periplasmic contents 
and bathing culture medium are then filtered and 
contacted with affinity resins as in the basic 



23 



wo 03/040391 



PCT/GB02/05075 



methodology. In this embodiment only those protein 
fragments that were efficiently secreted into the 
periplasmic space^. were proteolyticaly released from the 
signal peptide, and were soluble and unaggregated 
5 following secretion from the cells or after osmotic 
shock;, are efficiently purified and will give strong 
anti-tag signals in immunoblot or ELISA assays. 

A further embodiment of the method seeks to identify 

10 candidate surface proteins from bacteria, suitable for 

vaccine development. In this embodiment, the method for 
identification of soluble fragments suitable for high- 
level expression and secretion in bacterial systems 
described above, is applied to screening a DNA fragment 

15 library derived from part of, or an entire bacterial 
genome, generated by some type of DNA fragmentation 
method. DNA fragments from such a library will be cloned 
into the expression vector for periplasmic export, and 
colonies screened for expression of soluble tagged- 

2 0 protein fragments in culture medium and periplasmic 

extract. Those expressed protein fragments that give 
strong anti-tag signals, will be those that were 
efficiently secreted into the periplasmic space, were 
proteolyticaly released from the signal peptide, and were 

25 soluble and unaggregated- It is most likely that protein 
fragments that fulfil these criteria efficiently will 
derive from extracellular proteins, or from the extra- 
cellular domains of transmembrane or integral membrane 
proteins, encoded by the bacterial genome being screened. 

30 Such proteins would have a high likelihood of being 

visible to the immune system of an organism infected by 
the bacterium being screened, and would therefore be good 
candidates for vaccine development. 
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In a further variation;, the method can be used to 
identify stable and soluble complexes formed between 
fragments of different proteins or between fragments of a 
single protein. In one embodiment of this variation^ two 
5 or more DNA fragment libraries are co-expressed in the 

same bacterial cell, either from the same vector, or from 
different compatible vectors simultaneously present- The 
libraries are cloned into the expression vector or 
vectors as in the basic method, but so that sequences 

10 encoding different affinity ' tags ^ are attached to the 

fragments encoded by the different DNA libraries. As in 
the basic method, bacterial cells are lysed and filtered, 
and contacted with affinity media that is specific to the 
(primary) affinity tag attached to only one library to 

15 select for soluble, folded and unaggregated protein 

fragments. As in the basic method protein levels are 
assayed by ELISA or immunoblot, but using antibodies 
directed against the (secondary) affinity tag (or tags) 
attached to the other library (or libraries) . Strong 

20 signals against a secondary tag, will indicate the 

presence of a fragment expressed from one library, that 
was efficiently transported by and formed a stable non- 
aggregated complex with a fragment from the primary 
library whose 'tag' was utilised for selection. 

25 

In a further aspect, the methods described herein may be 
combined to provide a method of producing a library of 
nucleic acid fragments, the nucleic acid fragments 
encoding one or more portions of a polypeptide, and 
30 identifying fragments encoding soluble protein domains, 
the method comprising: 

amplifying a nucleic acid sequence encoding the 
polypeptide in the presence of a non-native nucleotide so 
that the non-native nucleotide is incorporated into the 
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amplified product nucleic acid sequence at a frequency 
related to the relative amounts of the non-native 
nucleotide and its corresponding native nucleotide, if 
present; 

5 contacting the product nucleic acid sequence with 

one or more reagents capable of recognising the presence 
of the non-native nucleotide and cleaving the product 
nucleic acid sequence or excising the non-native 
nucleotide, thereby producing nucleic acid sequences 

10 encoding fragments of the polypeptide; 

expressing a library of the nucleic acid fragments 
to produce the protein domains encoded by the fragments, 
wherein the protein domains are expressed as fusions with 
an affinity tag; and 

15 separating soluble proteins using the affinity tag. 

Embodiments of the present invention will now be 
described in more detail by way of example and not 
limitation with reference to the accompanying figures. 

20 

Brief Descrxp-bion of the Figures 

Figure 1 shows a representation of the fragmentation of a 
single molecule of PGR product with a low level of dUTP 
incorporated. Since the position at which the dUTP is 

25 incorporated is different in different PGR product 
molecules, the position at which cleavage occurs is 
different and will therefore result in sampling of all 
possible positions in a particular coding sequence. A 
library of DNA fragments are therefore produced that 

30 sample all possible positions representing all possible 

fragments within a certain size range, that is determined 
by the ratio of dUTPrTTP used in the initial 
amplification reaction . 
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Figure 2 shows a gel showing the nucleic acid fragments 
produced when the method described herein was applied to 
exon 11 of BRCA2, eIF2, NS5 and p85nic. 

5 

Figure 3 shows the effect of UDG, APE and p-elimination 
treatment on NS5 PGR products comprising different levels 
of dUTP incorporation. 

10 Figure 4 shows the PGR product produced after 

amplification of p85nic with 1% dUTP before and after 
fragmentation • 

Figure 5 shows agarose gel electrophoresis of restriction 
15 digests of pGRBlunt/p85nic fragment clones and pGRT&~ 
NT/p85nic fragment clones - 

Figure 6 shows the analysis of the selectivity of the 
purification method for soluble vs insoluble protein. 
20 Samples of cell extract;. Turbo-filtered cell extract And 
Ni-NTA eluate from purification trials of soluble cStil, 
insoluble full length Gsk and the insoluble catalytic 
domain of Gsk were run on SDS-PAGE. 

25 Detailed Description 
Introduction 

We have developed a method for identification of protein 
domains comprising two main steps: 1) production of a 
library of expression vectors that contain DNA fragments 
30 of defined size range that have been sampled essentially 
randomly from a particular target coding sequence; 2) 
screening of the library for clones that express soluble 
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protein domains. The first step employs an enzymatic 
fragmentation method based on the DNA base-excision 
repair pathway and the second step makes use of a protein 
purification method that is selective for soluble protein 
5 domains over insoluble protein fragments. The two key 
novel aspects of the methodology have been tested in two 
separate pilot feasibility studies: one involving the 
novel gene fragmentation aspects of the technology and 
another involving testing of the selectivity of the 

10 protein purification method for soluble proteins with 

tertiary structural integrity. These studies demonstrate 
that the DNA fragmentation method is efficient and 
reproducible, generating blunt-ended DNA fragments 
suitable for cloning. In addition, the fragment size 

15 range produced is found to be reproducible and solely a 
function of the ratio of dUTP:TTP used in the 
amplification of the PGR product. In a second aspect, 
these studies show the present protein purification 
method to be highly selective for soluble vs. insoluble 

20 protein and therefore suitable for screening of libraries 
of clones in order to identify those that produce soluble 
protein domains. 

Materials and Methods 
25 PGR 

Initially four coding sequences were identified as 
potential targets for application of the ''Domain hunting" 
method: human BRCA2 exon 11, yeast elongation initiation 
factor 2, Dengue virus type 1 NS5 and the N-SH2-Inter- 
30 SH2-G-SH2 region of the human signal transduction protein 
p85. Oligonucleotide primers were designed and 
synthesised for PGR amplification of each coding 
sequence. PGR was then performed using Taq DNA 
polymerase according to the manufacturers instructions 
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except that dGTP, dCTP, dATP were used at a concentration 
of 200 |LiM each, and TTP and dUTP were used at a 
concentration of 198 |liM and 2 |liM respectively. PGR was 
therefore performed in the presence of a ratio of 99% TTP 

5 to 1% dUTP allowing incorporation of dUTP at an average 
of --1% at any particular thymidine nucleotide position in 
the sequences. Thirty cycles of PGR were performed for 
each template and an annealing temperature 5°G below the 
theoretical melting temperature was used for each 
10 reaction. The extension time used for each reaction was 
60 seconds per kilobase of full-length product. 

Fragmentation of PGR products 

Digestion with UDG and APE enzymes: 

15 The fragmentation protocol is summarised in Figure 1. 

The above NS5 and p85nic PGR products were treated with 
UDG (New England Biolabs . Inc.) and APE enzymes as below. 
Nth and NFO were over-expressed in E. coll and purified 
to homogeneity. Two different APE enzymes were assessed 

20 for their cleavage efficiency^ NFO and Nth. 

PGR products were purified by agarose gel electrophoresis 
and gel extraction according to the manufacturer's 
instructions (Qiagen Inc.) and then incubated with lU of 

25 UDG per microgram of DNA and 2 |li1 of 2 {xg/jil APE (either 
Nth or NFO) per microgram of DNA at 37 °G for 60 mins. 
Spermine tetrahydorchloride (Galbiochem Inc.) was then 
added to 0 . 2mM final concentration before incubating at 
37°G for 30 mins and then 70°G for 15 mins and 4°G 2 

30 mins. The product was then purified (PGR purification 
kit, Qiagen Inc.) and the purified DNA eluted in 1 mM 
Tris.HGl pH8 . 0 . The product was then incubated with 1 
unit of Sl-nuclease per microgram of DNA at 37 °C for 60 
mins. The product was then purified by 1% agarose gel 
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electrophoresis and a block of gel corresponding to DNA 
products of 300-600bp was excised and purified by gel 
extraction as above. The above product was then treated 
with shrimp alkaline phosphatase using one unit of enzyme 
5 per microgram of DNA at 37 °C for one hour before adding 
the same quantity of fresh enzyme and incubating for a 
further hour. The reaction was then heated to Sb'^C for 
15 minutes to totally inactivate the alkaline 
phosphatase. The product was then purified (PGR 

10 purification kit, Qiagen Inc.) and the purified DNA 

eluted in 1 mM Tris.HCl pH8 . 0 . This DNA was then used 
for blunt-end cloning as described below. Alternatively ;r 
for TA cloning using the pCRT7-NT-T0P0 vector 
(Invitrogen^ Inc.) a final incubation with Taq DNA 

15 polymerase was performed to add single adenine nucleotide 
to the 3^ ends of the products. This was performed by 
incubating the product for 15 minutes at 72 °C in the 
presence of a conventional PGR reaction mixture, well 
known to those skilled in the art, but without primers. 

20 

Cloning of the DNA fragments 

--lOQng of the above fragmented p85nic coding sequence was 
cloned using three different vectors (pGRBlunt, 
pGR4Blunt-T0P0 and pGRT7-NT-T0P0) according to the 
25 manufacturer's protocol (Invitrogen Inc.). The 

transformation reactions were plated onto LB agar plates 
containing either ampicillin (pGRT7-NT-T0P0) or kanamycin 
(pGRBlunt, pGR4Blunt-T0P0) depending on the vector used 
for transformation. 

30 

Analysis of clones 

Plasmid minipreps were performed (Qiagen inc.) for 40 
clones derived from pGRT7-NT-TOPO/p8 5 fragment 
transformations and for -20 clones derived from 



30 



wo 03/040391 



PCT/GB02/05075 



pCRBluntTOPO and --2 0 clones derived from pCRBlunt. 
pCRT7-NT-TOPO/p85 fragment derived plasmids were digested 
with EcoRl and BamHI (New England Biolabs Inc.) and 
analysed by 1% agarose gel electrophosesis . 

5 

Plasmid samples were DNA sequenced using the Cambridge 
University Biochemistry Dept. DNA sequencing service and 
results analysed using Vector NTI (Informax Inc.). 

10 Selective purification of folded protein 

50 ml cultures of E. coli BL21(DE3) cells expressing 
soluble C-terminal domain of Stil (cStil) (REF) , or 
insoluble Gsk3 (full length and catalytic domain) (REF) 
were pelleted and resuspended in 5 ml of lysis buffer (50 

15 mM NaH2P04r 300 mM NaCl, 1 mM imidazole pH 8.0) - Img ml"^ 
lysozyme and 10 |Lig of Rnase A were added and the lysate 
incubated on ice for 30 min. 0.5ml of the lysate was 
then passed through a Qiagen TurboFilter (8 strip) as 
described by the manufacturer (Qiagen Inc.)- 200 JlII of 

20 the cleared lysates were added to 20 |li1 of Ni-NTA 

magnetic beads in 96 well microtitre plates. The plates 
were then shaken for 60 min, the beads washed twice in 
lysis buffer containing 10 mM imidazole, and bound 
protein eluted with 50 |xl lysis buffer containing 300 mM 

25 imidazole. 20|al aliquots of the whole cell extract, the 
turbo-filtered extract and eluate from the beads was 
analysed by SDS-PAGE. 

Results 

30 DNA fragmentation trials 

We have performed computer modelling experiments to 
predict the size of fragments that would be produced by 
the present DNA fragmention method for different levels 
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of dUTP incorporation. These predicted that 1% dUTP 
incorporation would produce a fragment size range with a 
distribution centering around 500bp. Four different 
coding sequences ranging in size from --1-3 -Ikb were 
5 therefore amplified by PGR using Taq DNA polymerase in 
the presence of 1% dUTP demonstrating that PGR is highly 
efficient under these conditions (Figure 2) . 

We have then compared NS5 PGR products amplified using 

10 different ratios of TTPidUTP (100:0, 99:1 and 90:10) by 
treatment with UDG and APE and p-elimination (Figure 3) . 
This indicates that as expected the PGR products with no 
dUTP incorporated are unaffected by this treatment while 
1% dUTP products show some slight evidence of 

15 fragmentation and 10% dUTP products show considerable 

evidence of fragmentation. These results are as expected 
since this treatment of 1% dUTP products with UDG and APE 
and p-elimination should introduce single stranded one 
nucleotide gaps in the dsDNA at ~500bp intervals on 

20 average. Similarly treatment of 10% dUTP products should 
produce gaps at intervals of around 50bp on average. On 
agarose gel electrophoresis therefore the 1% dUTP 
products would migrate in essentially the same way as 
uncut 100% TTP products since the 65°G 15 minute 

25 incubation step used for p-elimination would not be 

expected to cause significant melting of strands with 
500bp overlaps between single-nucleotide gaps. The 10% 
dUTP product would however be expected to have melted 
significantly and then reannealed to produce a mixture of 

30 smaller annealed products consistent with that observed. 

The whole fragmentation method (Figure 1) has been 
applied to 1% dUTP p85nic PGR product (Figure 2) . This 
has been repeated using different APE enzymes and with 
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different lengths of incubation always yielding the same 
size distribution of product ranging from --lOObp to l-2kb 
with maximum band intensity centred around 50 0bp as 
predicted (Figure 4) . This process has been scaled up 
5 reproducibly for fragmentation of --10 \xg of DNA;. 

indicating that generation of quantities of product 
sufficient for production of large libraries of clones 
according to the present invention is feasible. 

10 Cloning 

Transformation of E. coll cells with p85nic fragment 
cloning reactions was successful using three different 
cloning approaches: pCRBlunt ligation; pCR4Blunt"T0P0 
cloning; and pCRT7-NT-T0P0 cloning. TOPO cloning of 

15 fragmented p85nic insert DNA into both pCR4Blunt-T0P0 and 
pCRT7-NT-T0P0 produced around 250 colonies per 100 ng of 
insert used. Blunt-end ligation of fragmented p85nic DNA 
to pCRBlunt produced --1000 colonies at 16oC and 120 
colonies at 37 °C per 100 ng of fragmented DNA. These 

20 results indicate that a substantial proportion of the DNA 
fragments produced as described are blunt ended as 
expected. Cloning of DNA fragments produced by the 
method using the above cloning methods is therefore of 
sufficiently high efficiency to allow generation of 

25 libraries of thousands of clones - 

Characterlsai^ion of cloned p85nxa fragmen'ts 

Restriction characterisation of plasmid DNA derived from 
clones generated by both TOPO cloning and blunt end 
30 ligation indicated that >90% of clones contained an 
insert and the distribution of the sizes of inserts 
correlated closely with the size range of p85nic DNA 
fragments used for cloning (Figure 5) . DNA sequencing of 
the cloned DNA inserts suggests that the fragments appear 
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to be sampled in an essentially random manner from the 
p85nic coding sequence- No nucleotide substitutions have 
yet been detected by DNA sequencing, indicating that as 
expected the method is not inherently mutagenic. DNA 
5 sequencing of a large number of clones is necessary in 
order to accurately measure the randomness of sampling,, 
frequency of mutation. 

Selective purification of folded protein 

10 In order to assess the selectivity of the purification 
method for folded protein versus unfolded or aggregated 
protein we have applied the purification method to a set 
of well-characterised proteins with known solubility 
properties. Cultures of E, coli BL21(DE3) cells 

15 expressing soluble C-terminal domain of Stil (cStil) , or 
insoluble Gsk3 (full length and catalytic domain) were 
harvested and the cells lysed enzymatically before 
passing through a Qiagen TurboFilter as described by the 
manufacturer (Figure 6) . This step cleared the cell 

20 lysates and significantly reduced the amount of the 

insoluble Gsk in the lysate,. but did not effect the level 
of the soluble cStil. Further reduction of the quantity 
of insoluble constructs was seen following Ni-NTA 
magnetic bead purification. The cleared lysates were 

25 then purified using Ni-NTA magnetic beads in 96 well 

microtitre plates. The whole cell extract;, the turbo- 
filtered extract and the Ni-NTA eluate were then analysed 
by SDS-PAGE showing that the recovery of the soluble 
cStil is at least 100 times more efficient than the 

30 insoluble constructs. The difference in the level of 
recovery of soluble vs. insoluble recombinant protein 
demonstrates that this purification method is highly 
selective for soluble folded protein over 
insoluble/misf olded protein over a wide dynamic range. 
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This purification approach will therefore allow sensitive 
detection of soluble folded protein fragments or domains 
over insoluble misfolded fragments and therefore allow 
identification of regions of protein sequence that 
5 correspond to folded protein. 

Conclusions 

The gene fragmentation study provided verification of 
incorporation of dUTP into the target gene by PGR;. 

10 fragmentation of the target gene, robust control of the 
range of fragment sizes generated and efficicnet cloning 
of the fragments. We have tested the efficiency of PGR 
in the presence of dUTP for four different coding 
sequences- We have then compared the behaviour of PGR 

15 products prepared in the presence of different ratios of 
TTPidUTP by treatment with uracil DNA glycosylase (UDG) 
and two different apurinic/apyrimidinic endonucleases 
(APE) . This demonstrated that fragmentation occurs only 
to uracil containing PGR products and that the size of 

20 the fragments produced corresponds directly to the 

dUTP:TTP ratio used in the PGR amplification step. We 
selected the p85nic coding sequence for further analysis 
by the above enzymes and also for subsequent treatment 
with spermine and SI nuclease. This demonstrated that 

25 fragments of p85nic of the size range predicted in theory 
for 1% dUTP incorporation were indeed produced. This 
also showed that as predicted these fragments were blunt 
ended since they could be cloned efficiently by blunt end 
cloning methods. A method for identification of soluble 

30 protein fragments or domains that can be efficiently 

expressed and purified from bacteria has been established 
and validated using several targets of well-characterised 
solubility properties. Coupling of the DNA 
fragmentation/cloning aspects with the soluble protein 
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domain identification aspects of the method therefore 
provides a holistic method for generation of vectors for 
high-level soluble expression of newly discovered protein 
domains. These vectors can then be used directly for 
5 production of large quantities of soluble protein domains 
for structural and functional studies without the need 
for any subsequent genetic manipulation or optimisation 
of protein expression or purification. 



10 
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Claims : 

1. A method of producing a library of nucleic acid 
fragments;, the nucleic acid fragments encoding one or 
more portions of a polypeptide^ and identifying fragments 
5 in the library encoding soluble protein domains the 
method comprising: 

amplifying a nucleic acid sequence encoding the 
polypeptide in the presence of a non-native nucleotide so 
that the non-native nucleotide is incorporated into the 
10 amplified product nucleic acid sequence at a frequency 
related to the relative amounts of the non-native 
nucleotide and its corresponding native nucleotide, if 
present; 

contacting the product nucleic acid sequence with 
15 one or more reagents capable of recognising the presence 
of the non-native nucleotide and cleaving the product 
nucleic acid sequence or excising the non-native 
nucleotide, thereby producing nucleic acid sequences 
encoding fragments of the polypeptide; 
20 expressing a library of the nucleic acid fragments 

to produce the protein domains encoded by the fragments, 
wherein the protein domains are expressed as fusions with 
an affinity tag; and 

separating soluble proteins using the affinity tag. 

25 

2- A method for producing a library of nucleic acid 
fragments, the nucleic acid fragments encoding one or 
more portions of a polypeptide, the method comprising: 
amplifying a nucleic acid sequence encoding the 
30 polypeptide in the presence of a non-native nucleotide so 
that the non-native nucleotide is incorporated into an 
amplified product nucleic acid sequence at a frequency 
related to the relative amounts of the non-native 
nucleotide and its corresponding native nucleotide, if 
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present; 

contacting the product nucleic acid sequence with 
one or more reagents capable of recognising the presence 
of the non-native nucleotide and cleaving the product 
5 nucleic acid sequence or excising the non-native 

nucleotide, thereby producing nucleic acid sequences 
encoding fragments of the polypeptide. 

3. The method of claim 1 or claim 2, wherein the step 
10 of amplifying the nucleic acid sequence is carried out 
using PGR using a non-native deoxynucleotide, either 
alone or in a mixture of the non-native and native 
nucleotide . 

15 4. The method of claim 3, wherein the non-native 
nucleotide is uracil or 3-methyl adenine. 

5. The method of any one of the preceding claims, 
wherein the nucleic acid sequence is present in a cDNA 

20 library, a RNA library or a sample of genomic DNA. 

6. The method of any one of the preceding claims, 
wherein the nucleic acid fragments of the polypeptide are 
between 200 and 1200 nucleotides in length. 

25 

7. The method of any one of the preceding claims, 
wherein the nucleic acid sequence is sampled on average 
about every second nucleotide to produce the nucleic acid 
fragments . 

30 

8. The method of any one of the preceding claims, 
wherein the reagent capable of recognising and cleaving 
the product nucleic acid sequence at the non-native 
nucleotide is an enzyme which can recognise the presence 
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of the non native nucleotide and cleave the nucleic acid 
sequence at or around the modified nucleic acid sequence - 

9. The method of claim 8, wherein the enzyme is a DNA 
5 glycosylase or an endonuclease - 

10. The method of any one of the preceding claims/ 
wherein the non-native nucleotide is deoxyuracil and the 
enzyme is apurinic/apyrimidinic endonuclease (APE), 

10 catalysed by uracil-DNA glycosylase (UDG) . 

11. The method of any one of the preceding claims, 
further comprising amplifying the nucleic acid fragments, 

15 12. The method of any one of the preceding claims, 

wherein in the library of protein domains, the protein 
domains comprise a constant portion and a portion sampled 
by the amplifying and contacting steps. 

20 13- The method of any one of the preceding claims, 

further comprising ligating the nucleic acid fragments 
into expression vector (s). 

14. The method of claim 13, further comprising 

25 transforming host cells with the expression vectors to 
produce a library of host cells capable of expressing 
fragments of the polypeptide. 

15. The method of claim 14, further comprising 

30 expressing the nucleic acid sequences encoding fragments 
of the polypeptide and optionally isolating the 
polypeptide fragments thus produced. 
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16. A method of identifying soluble protein domains, the 
method comprising: 

expressing a library of nucleic acid fragments to 
produce the protein domains encoded by the fragments, 
5 wherein the protein domains are expressed as fusions with 
an affinity tag; and 

separating soluble proteins using the affinity tag. 

17. The method of any one of the preceding claims, 

10 wherein the polypeptide fragments are expressed to 
include a protease cleavage site. 

18. The method of any one of the preceding claims, 
wherein the polypeptide fragments are expressed to 

15 include an affinity tag. 

19. The method of claim 18, wherein the affinity tag is 
a peptide which is less than 15 amino acids in length. 

20 20. The method of claim 18 or claim 19, wherein the 

affinity tag is fused to the C-terminus of the protein 
fragments . 

21. The method of any one of claims 18 to 20, wherein 

25 the affinity tag is polyhistidine, a Flag or Glu epitope, 
a S-tag, calmodulin binding peptide or ribonuclease S. 

22. The method of claim 21, wherein the affinity tag is 
a Hise-tag. 

30 

23. The method of any one of claims 14 to 22, further 
comprising releasing the soluble protein domains from the 
cells . 
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24. The method of claim 23, wherein the step of 
releasing the protein is carried out under non-denaturing 
conditions . 

5 25. The method of claim 24, wherein the non-denaturing 
condition comprise the use of enzymes or non-denaturing 
detergents . 

26. The method of any one of claims 14 to 25, further 
10 comprising filtering out unbroken cells, cell debris and 

insoluble material. 

27. The method of any one of claims 14 to 25, further 
comprising clarifying the lysates by centrif ugation . 

15 

28. The method of any one of claims 14 to 27, further 
comprising purifying cells transformed with different 
proteins in parallel by affinity chromatography. 

20 29. The method of any one of the preceding claims, 
wherein the step of separating the soluble protein 
domains is carried out by contacting the library of 
protein domains with a solid phase having a binding 
partner of the affinity tag immobilised thereon. 

25 

30. The method of claim 29, wherein the binding partner 
is : 

(a) metal ions such as Ni^"^ or Co^"*" for binding a 
polyhistidine affinity tag; or 
30 (b) anti-Flag antibodies for binding a Flag or Glu 

epitope affinity tag; or 

(c) streptavidin for binding a S-tag affinity tag; 

or 

(d) calmodulin in the presence of Ca^"*" for binding a 
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calmodulin binding peptide affinity tag; or 

(e) aporibonuclease S for binding a ribonuclease S 
affinity tag. 

5 31. The method of any one of the preceding claims, 

further comprising assaying for the presence of soluble 
protein fragments . 

32. The method of claim 31, wherein the step of assaying 
10 is carried out using anti-tag ELISA, SDS-PAGE or LC-ESI- 
MS. 

33- The method of claim 31 or claim 32, wherein the step 
of assaying for the soluble protein domains comprises 
15 quantifying the protein expression level of one or more 
or the protein domains. 

34. The method of any one of the preceding claims, 
further comprising identifying or sequencing the soluble 

20 proteins. 

35. The method of any one of the preceding claims, 
further comprising contacting the library of fragments or 
domains with: 

25 (a) one or more candidate compounds to determine 

whether the candidate compound binds to and/or modulates 
an activity of a protein fragment or domain present in 
the library; and/or 

(b) one or more test proteins to determine whether 

30 a protein fragment or domain present in the library binds 
to and/or modulates an activity of the test protein. 

36. The method of any one of the preceding claims, 
further comprising contacting two or more libraries of 
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soluble protein fragments or domains to determine whether 
binding or modulation of activity occurs between the 
protein fragments or domains present in the libraries. 

5 37. The method of claim 36, wherein the method is 

employed to determine which portions of the proteins used 
to construct the libraries are involved in binding and 
biological activity . 

10 38. The method of any one of claims 35 to 37, wherein 

the method is used to determine binding between a ligand 
and a receptor, an enzyme and a substrate, an antibody 
and an antigen, or a small molecule and a protein. 

15 39. A library produced by the method of any one of the 
preceding claims. 

40. The library of claim 39, which is: 

(a) a library of nucleic acid fragments of a parent 
20 nucleic acid sequence, wherein the nucleic acid fragments 
have a size range between 200 and 1200 nucleotides in 
length and are preferably sampled from the parent nucleic 
acid sequence on average about every second nucleotide; 
or 

25 (b) a library of expression vectors which comprise 

a plurality of the nucleic acid fragments as set out in 
(a) , wherein each fragment is ligated to a nucleic acid 
sequence encoding an affinity tag and optionally one or 
more further sequences to direct the expression of the 

30 nucleic acid fragment and the affinity tag; or, 

(c) a library of host cells transformed with the 
expression vectors as defined in (b) ; or 

(d) a library of polypeptide fragments produced by 
expressing the nucleic acid sequences set out in step 
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(a) f wherein each polypeptide is coupled to an affinity 
tag . 

5 
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